Skip to content

Skip GCS upload when product content is unchanged#88

Merged
jirhiker merged 2 commits into
mainfrom
feature/gcs-content-dedup
Jun 28, 2026
Merged

Skip GCS upload when product content is unchanged#88
jirhiker merged 2 commits into
mainfrom
feature/gcs-content-dedup

Conversation

@jirhiker

Copy link
Copy Markdown
Member

What

Avoid duplicating datasets in GCS. Every run wrote a new dated archive ({date}.geojson) and overwrote latest.geojson even when the data hadn't changed. Now the upload is skipped when content matches what's already there.

How

  • _content_hash(local_path) — SHA-256 of the GeoJSON ignoring the volatile timeStamp, so an unchanged dataset hashes identically across runs (features + the rest of the collection are included).
  • upload_product stores the hash in the blob metadata (content_hash) on the dated blob; copy_blob carries it to latest.geojson. On the next run it reads latest's stored hash and, on a match, skips writing — returns dated_uri=None, skipped=True. Otherwise it uploads as before.
  • The combine asset surfaces skipped_unchanged in MaterializeResult metadata and omits dated_uri when skipped.

Why hash-minus-timeStamp

The collection embeds "timeStamp": <now> every run, so a raw file/object hash would always differ. Hashing the content with timeStamp removed makes "unchanged data" detectable.

Verification

  • Hash stable across differing timeStamp; changes when features change.
  • Skip path: nothing written (upload_from_filename/copy_blob not called), skipped=True, dated_uri=None.
  • Upload path: metadata set, dated written, copied to latest.
  • Definitions load.

Note

First run after deploy always uploads (no stored hash yet), seeding the metadata for subsequent dedup.

🤖 Generated with Claude Code

Every run wrote a new dated archive + overwrote latest.geojson even when
the data was identical, duplicating datasets in GCS. Dedup by content
hash:

- _content_hash hashes the GeoJSON ignoring the volatile timeStamp, so an
  unchanged dataset hashes the same across runs.
- upload_product stores the hash on the dated/latest blob metadata; on the
  next run it compares against latest.geojson's stored hash and, on a
  match, skips writing entirely (dated_uri=None, skipped=True).
- combine asset surfaces skipped_unchanged in MaterializeResult metadata
  and omits dated_uri when skipped.

Verified: hash is stable across timeStamp, changes with features; skip
path writes nothing, upload path sets metadata + copies to latest.

Co-Authored-By: Claude Opus 4.8 <noreply@anthropic.com>
@github-actions

github-actions Bot commented Jun 27, 2026

Copy link
Copy Markdown

Your pull request is automatically being deployed to Dagster Cloud.

Location Status Link Updated
die-orchestration View in Cloud Jun 28, 2026 at 04:43 AM (UTC)

Persist last_changed (YYYY-MM-DD the content actually changed) in the
GCS blob metadata alongside content_hash, and report days_since_last_change
on every run. On a skipped (unchanged) run the last_changed date is
carried forward and the day count grows; on a real change it resets to 0.

Surfaced in the combine asset's MaterializeResult metadata
(last_changed, days_since_last_change) so a product whose data has been
static for, say, 60+ days is an obvious candidate to relax from a daily
to a monthly schedule.

Co-Authored-By: Claude Opus 4.8 <noreply@anthropic.com>
@jirhiker jirhiker merged commit d286019 into main Jun 28, 2026
2 checks passed
@jirhiker jirhiker deleted the feature/gcs-content-dedup branch June 28, 2026 04:43
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

None yet

Projects

None yet

Development

Successfully merging this pull request may close these issues.

1 participant